Computation and Language 54
☆ SparseOptimizer: Sparsify Language Models through Moreau-Yosida Regularization and Accelerate through Compiler Co-design
This paper introduces SparseOptimizer, a novel deep learning optimizer that
exploits Moreau-Yosida regularization to naturally induce sparsity in large
language models such as BERT, ALBERT and GPT. Key to the design of
SparseOptimizer is an embedded shrinkage operator, which imparts sparsity
directly within the optimization process. This operator, backed by a sound
theoretical framework, includes an analytical solution, thereby reinforcing the
optimizer's robustness and efficacy. Crucially, SparseOptimizer's plug-and-play
functionality eradicates the need for code modifications, making it a
universally adaptable tool for a wide array of large language models. Empirical
evaluations on benchmark datasets such as GLUE, RACE, SQuAD1, and SQuAD2
confirm that SparseBERT and SparseALBERT, when sparsified using
SparseOptimizer, achieve performance comparable to their dense counterparts,
BERT and ALBERT, while significantly reducing their parameter count. Further,
this work proposes an innovative optimizer-compiler co-design strategy,
demonstrating the potential of inference acceleration (\textbf{3.37x},
\textbf{6.30x}, and \textbf{7.15x} in comparison with Pytorch, TensorFlow, and
LLVM generic compile, respectively) in SparseBERT when paired with an
appropriately designed compiler. This study represents a significant step
forward in the evolution of efficient, scalable, and high-performing large
language models, setting a precedent for future exploration and optimization in
this domain. The SparseOptimizer code and SparseALBERT model will be made
available upon paper acceptance.
☆ Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux
To realize human-robot collaboration, robots need to execute actions for new
tasks according to human instructions given finite prior knowledge. Human
experts can share their knowledge of how to perform a task with a robot through
multi-modal instructions in their demonstrations, showing a sequence of
short-horizon steps to achieve a long-horizon goal. This paper introduces a
method for robot action sequence generation from instruction videos using (1)
an audio-visual Transformer that converts audio-visual features and instruction
speech to a sequence of robot actions called dynamic movement primitives (DMPs)
and (2) style-transfer-based training that employs multi-task learning with
video captioning and weakly-supervised learning with a semantic classifier to
exploit unpaired video-action data. We built a system that accomplishes various
cooking actions, where an arm robot executes a DMP sequence acquired from a
cooking video using the audio-visual Transformer. Experiments with
Epic-Kitchen-100, YouCookII, QuerYD, and in-house instruction video datasets
show that the proposed method improves the quality of DMP sequences by 2.3
times the METEOR score obtained with a baseline video-to-action Transformer.
The model achieved 32% of the task success rate with the task knowledge of the
object.
comment: Accepted to Interspeech2023
☆ Automatic Annotation of Direct Speech in Written French Narratives ACL 2023
The automatic annotation of direct speech (AADS) in written text has been
often used in computational narrative understanding. Methods based on either
rules or deep neural networks have been explored, in particular for English or
German languages. Yet, for French, our target language, not many works exist.
Our goal is to create a unified framework to design and evaluate AADS models in
French. For this, we consolidated the largest-to-date French narrative dataset
annotated with DS per word; we adapted various baselines for sequence labelling
or from AADS in other languages; and we designed and conducted an extensive
evaluation focused on generalisation. Results show that the task still requires
substantial efforts and emphasise characteristics of each baseline. Although
this framework could be improved, it is a step further to encourage more
research on the topic.
comment: 9 pages, ACL 2023
☆ Constructing Multilingual Code Search Dataset Using Neural Machine Translation ACL2023
Code search is a task to find programming codes that semantically match the
given natural language queries. Even though some of the existing datasets for
this task are multilingual on the programming language side, their query data
are only in English. In this research, we create a multilingual code search
dataset in four natural and four programming languages using a neural machine
translation model. Using our dataset, we pre-train and fine-tune the
Transformer-based models and then evaluate them on multiple code search test
sets. Our results show that the model pre-trained with all natural and
programming language data has performed best in most cases. By applying
back-translation data filtering to our dataset, we demonstrate that the
translation quality affects the model's performance to a certain extent, but
the data size matters more.
comment: To appear in the Proceedings of the ACL2023 Student Research Workshop
(SRW)
☆ Extending Context Window of Large Language Models via Positional Interpolation
We present Position Interpolation (PI) that extends the context window sizes
of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal
fine-tuning (within 1000 steps), while demonstrating strong empirical results
on various tasks that require long context, including passkey retrieval,
language modeling, and long document summarization from LLaMA 7B to 65B.
Meanwhile, the extended model by Position Interpolation preserve quality
relatively well on tasks within its original context window. To achieve this
goal, Position Interpolation linearly down-scales the input position indices to
match the original context window size, rather than extrapolating beyond the
trained context length which may lead to catastrophically high attention scores
that completely ruin the self-attention mechanism. Our theoretical study shows
that the upper bound of interpolation is at least $\sim 600 \times$ smaller
than that of extrapolation, further demonstrating its stability. Models
extended via Position Interpolation retain its original architecture and can
reuse most pre-existing optimization and infrastructure.
☆ CrunchGPT: A chatGPT assisted framework for scientific machine learning
Scientific Machine Learning (SciML) has advanced recently across many
different areas in computational science and engineering. The objective is to
integrate data and physics seamlessly without the need of employing elaborate
and computationally taxing data assimilation schemes. However, preprocessing,
problem formulation, code generation, postprocessing and analysis are still
time consuming and may prevent SciML from wide applicability in industrial
applications and in digital twin frameworks. Here, we integrate the various
stages of SciML under the umbrella of ChatGPT, to formulate CrunchGPT, which
plays the role of a conductor orchestrating the entire workflow of SciML based
on simple prompts by the user. Specifically, we present two examples that
demonstrate the potential use of CrunchGPT in optimizing airfoils in
aerodynamics, and in obtaining flow fields in various geometries in interactive
mode, with emphasis on the validation stage. To demonstrate the flow of the
CrunchGPT, and create an infrastructure that can facilitate a broader vision,
we built a webapp based guided user interface, that includes options for a
comprehensive summary report. The overall objective is to extend CrunchGPT to
handle diverse problems in computational mechanics, design, optimization and
controls, and general scientific computing tasks involved in SciML, hence using
it as a research assistant tool but also as an educational tool. While here the
examples focus in fluid mechanics, future versions will target solid mechanics
and materials science, geophysics, systems biology and bioinformatics.
comment: 20 pages, 26 figures
☆ CamemBERT-bio: a Tasty French Language Model Better for your Health
Clinical data in hospitals are increasingly accessible for research through
clinical data warehouses, however these documents are unstructured. It is
therefore necessary to extract information from medical reports to conduct
clinical studies. Transfer learning with BERT-like models such as CamemBERT has
allowed major advances, especially for named entity recognition. However, these
models are trained for plain language and are less efficient on biomedical
data. This is why we propose a new French public biomedical dataset on which we
have continued the pre-training of CamemBERT. Thus, we introduce a first
version of CamemBERT-bio, a specialized public model for the French biomedical
domain that shows 2.54 points of F1 score improvement on average on different
biomedical named entity recognition tasks.
☆ Unleashing the Power of User Reviews: Exploring Airline Choices at Catania Airport, Italy
This study aims to investigate the possible relationship between the
mechanisms of social influence and the choice of airline, through the use of
new tools, with the aim of understanding whether they can contribute to a
better understanding of the factors influencing the decisions of consumers in
the aviation sector. We have chosen to extract user reviews from well-known
platforms: Trustpilot, Google, and Twitter. By combining web scraping
techniques, we have been able to collect a comprehensive dataset comprising a
wide range of user opinions, feedback, and ratings. We then refined the BERT
model to focus on insightful sentiment in the context of airline reviews.
Through our analysis, we observed an intriguing trend of average negative
sentiment scores across various airlines, giving us deeper insight into the
dynamics between airlines and helping us identify key partnerships, popular
routes, and airlines that play a central role in the aeronautical ecosystem of
Catania airport during the specified period. Our investigation led us to find
that, despite an airline having received prestigious awards as a low-cost
leader in Europe for two consecutive years 2021 and 2022, the "Catanese" user
tends to suffer the dominant position of other companies. Understanding the
impact of positive reviews and leveraging sentiment analysis can help airlines
improve their reputation, attract more customers, and ultimately gain a
competitive edge in the marketplace.
comment: arXiv admin note: text overlap with arXiv:1311.3475 by other authors
☆ Paradigm Shift in Sustainability Disclosure Analysis: Empowering Stakeholders with CHATREPORT, a Language Model-Based Tool
Jingwei Ni, Julia Bingler, Chiara Colesanti-Senni, Mathias Kraus, Glen Gostlow, Tobias Schimanski, Dominik Stammbach, Saeid Ashraf Vaghefi, Qian Wang, Nicolas Webersinke, Tobias Wekhof, Tingyu Yu, Markus Leippold
This paper introduces a novel approach to enhance Large Language Models
(LLMs) with expert knowledge to automate the analysis of corporate
sustainability reports by benchmarking them against the Task Force for
Climate-Related Financial Disclosures (TCFD) recommendations. Corporate
sustainability reports are crucial in assessing organizations' environmental
and social risks and impacts. However, analyzing these reports' vast amounts of
information makes human analysis often too costly. As a result, only a few
entities worldwide have the resources to analyze these reports, which could
lead to a lack of transparency. While AI-powered tools can automatically
analyze the data, they are prone to inaccuracies as they lack domain-specific
expertise. This paper introduces a novel approach to enhance LLMs with expert
knowledge to automate the analysis of corporate sustainability reports. We
christen our tool CHATREPORT, and apply it in a first use case to assess
corporate climate risk disclosures following the TCFD recommendations.
CHATREPORT results from collaborating with experts in climate science, finance,
economic policy, and computer science, demonstrating how domain experts can be
involved in developing AI tools. We make our prompt templates, generated data,
and scores available to the public to encourage transparency.
comment: This is a working paper
☆ Using Large Language Models to Provide Explanatory Feedback to Human Tutors
Jionghao Lin, Danielle R. Thomas, Feifei Han, Shivang Gupta, Wei Tan, Ngoc Dang Nguyen, Kenneth R. Koedinger
Research demonstrates learners engaging in the process of producing
explanations to support their reasoning, can have a positive impact on
learning. However, providing learners real-time explanatory feedback often
presents challenges related to classification accuracy, particularly in
domain-specific environments, containing situationally complex and nuanced
responses. We present two approaches for supplying tutors real-time feedback
within an online lesson on how to give students effective praise. This
work-in-progress demonstrates considerable accuracy in binary classification
for corrective feedback of effective, or effort-based (F1 score = 0.811), and
ineffective, or outcome-based (F1 score = 0.350), praise responses. More
notably, we introduce progress towards an enhanced approach of providing
explanatory feedback using large language model-facilitated named entity
recognition, which can provide tutors feedback, not only while engaging in
lessons, but can potentially suggest real-time tutor moves. Future work
involves leveraging large language models for data augmentation to improve
accuracy, while also developing an explanatory feedback interface.
comment: 12 pages Workshop paper, The 24th International Conference on
Artificial Intelligence in Education, AIED 2023 Educational Dialogue Act
Classification, Large Language Models, Named Entity Recognition, Tutor
Training, Explanatory Feedback, Natural Language Processing
☆ KnowPrefix-Tuning: A Two-Stage Prefix-Tuning Framework for Knowledge-Grounded Dialogue Generation ECML-PKDD 2023
Existing knowledge-grounded conversation systems generate responses typically
in a retrieve-then-generate manner. They require a large knowledge base and a
strong knowledge retrieval component, which is time- and resource-consuming. In
this paper, we address the challenge by leveraging the inherent knowledge
encoded in the pre-trained language models (PLMs). We propose Knowledgeable
Prefix Tuning (KnowPrefix-Tuning), a two-stage tuning framework, bypassing the
retrieval process in a knowledge-grounded conversation system by injecting
prior knowledge into the lightweight knowledge prefix. The knowledge prefix is
a sequence of continuous knowledge-specific vectors that can be learned during
training. In addition, we propose a novel interactive re-parameterization
mechanism that allows the prefix to interact fully with the PLM during the
optimization of response generation. Experimental results demonstrate that
KnowPrefix-Tuning outperforms fine-tuning and other lightweight tuning
approaches, and performs comparably with strong retrieval-based baselines while
being $3\times$ faster during inference.
comment: Accepted by ECML-PKDD 2023 (Research Track)
☆ Phase Space Analysis of Cardiac Spectra
Cardiac diseases are one of the main reasons of mortality in modern,
industrialized societies, and they cause high expenses in public health
systems. Therefore, it is important to develop analytical methods to improve
cardiac diagnostics. Electric activity of heart was first modeled by using a
set of nonlinear differential equations. Latter, variations of cardiac spectra
originated from deterministic dynamics are investigated. Analyzing the power
spectra of a normal human heart presents His-Purkinje network, possessing a
fractal like structure. Phase space trajectories are extracted from the time
series graph of ECG. Lower values of fractal dimension, D indicate dynamics
that are more coherent. If D has non-integer values greater than two when the
system becomes chaotic or strange attractor. Recently, the development of a
fast and robust method, which can be applied to multichannel physiologic
signals, was reported. This manuscript investigates two different ECG systems
produced from normal and abnormal human hearts to introduce an auxiliary phase
space method in conjunction with ECG signals for diagnoses of heart diseases.
Here, the data for each person includes two signals based on V_4 and modified
lead III (MLIII) respectively. Fractal analysis method is employed on the
trajectories constructed in phase space, from which the fractal dimension D is
obtained using the box counting method. It is observed that, MLIII signals have
larger D values than the first signals (V_4), predicting more randomness yet
more information. The lowest value of D (1.708) indicates the perfect
oscillation of the normal heart and the highest value of D (1.863) presents the
randomness of the abnormal heart. Our significant finding is that the phase
space picture presents the distribution of the peak heights from the ECG
spectra, giving valuable information about heart activities in conjunction with
ECG.
comment: 10 pages, 8 figures, 1 table. arXiv admin note: text overlap with
arXiv:2305.10450
☆ Quality Estimation of Machine Translated Texts based on Direct Evidence from Training Data
Current Machine Translation systems achieve very good results on a growing
variety of language pairs and data sets. However, it is now well known that
they produce fluent translation outputs that often can contain important
meaning errors. Quality Estimation task deals with the estimation of quality of
translations produced by a Machine Translation system without depending on
Reference Translations. A number of approaches have been suggested over the
years. In this paper we show that the parallel corpus used as training data for
training the MT system holds direct clues for estimating the quality of
translations produced by the MT system. Our experiments show that this simple
and direct method holds promise for quality estimation of translations produced
by any purely data driven machine translation system.
☆ Exploiting Pseudo Future Contexts for Emotion Recognition in Conversations
With the extensive accumulation of conversational data on the Internet,
emotion recognition in conversations (ERC) has received increasing attention.
Previous efforts of this task mainly focus on leveraging contextual and
speaker-specific features, or integrating heterogeneous external commonsense
knowledge. Among them, some heavily rely on future contexts, which, however,
are not always available in real-life scenarios. This fact inspires us to
generate pseudo future contexts to improve ERC. Specifically, for an utterance,
we generate its future context with pre-trained language models, potentially
containing extra beneficial knowledge in a conversational form homogeneous with
the historical ones. These characteristics make pseudo future contexts easily
fused with historical contexts and historical speaker-specific contexts,
yielding a conceptually simple framework systematically integrating
multi-contexts. Experimental results on four ERC datasets demonstrate our
method's superiority. Further in-depth analyses reveal that pseudo future
contexts can rival real ones to some extent, especially in relatively
context-independent conversations.
comment: 15 pages, accepted by ADMA 2023
☆ The Architecture of a Biologically Plausible Language Organ
We present a simulated biologically plausible language organ, made up of
stylized but realistic neurons, synapses, brain areas, plasticity, and a
simplified model of sensory perception. We show through experiments that this
model succeeds in an important early step in language acquisition: the learning
of nouns, verbs, and their meanings, from the grounded input of only a modest
number of sentences. Learning in this system is achieved through Hebbian
plasticity, and without backpropagation. Our model goes beyond a parser
previously designed in a similar environment, with the critical addition of a
biologically plausible account for how language can be acquired in the infant's
brain, not just processed by a mature brain.
comment: 9 pages, 4 figures
☆ 3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement
Disentangling uncorrelated information in speech utterances is a crucial
research topic within speech community. Different speech-related tasks focus on
extracting distinct speech representations while minimizing the affects of
other uncorrelated information. We present a large-scale speech corpus to
facilitate the research of speech representation disentanglement. 3D-Speaker
contains over 10,000 speakers, each of whom are simultaneously recorded by
multiple Devices, locating at different Distances, and some speakers are
speaking multiple Dialects. The controlled combinations of multi-dimensional
audio data yield a matrix of a diverse blend of speech representation
entanglement, thereby motivating intriguing methods to untangle them. The
multi-domain nature of 3D-Speaker also makes it a suitable resource to evaluate
large universal speech models and experiment methods of out-of-domain learning
and self-supervised learning. https://3dspeaker.github.io/
☆ Understanding Client Reactions in Online Mental Health Counseling ACL 2023
Communication success relies heavily on reading participants' reactions. Such
feedback is especially important for mental health counselors, who must
carefully consider the client's progress and adjust their approach accordingly.
However, previous NLP research on counseling has mainly focused on studying
counselors' intervention strategies rather than their clients' reactions to the
intervention. This work aims to fill this gap by developing a theoretically
grounded annotation framework that encompasses counselors' strategies and
client reaction behaviors. The framework has been tested against a large-scale,
high-quality text-based counseling dataset we collected over the past two years
from an online welfare counseling platform. Our study shows how clients react
to counselors' strategies, how such reactions affect the final counseling
outcomes, and how counselors can adjust their strategies in response to these
reactions. We also demonstrate that this study can help counselors
automatically predict their clients' states.
comment: Accept to ACL 2023, oral. For code and data, see
https://github.com/dll-wu/Client-React
☆ Gender Bias in BERT -- Measuring and Analysing Biases through Sentiment Rating in a Realistic Downstream Classification Task
Pretrained language models are publicly available and constantly finetuned
for various real-life applications. As they become capable of grasping complex
contextual information, harmful biases are likely increasingly intertwined with
those models. This paper analyses gender bias in BERT models with two main
contributions: First, a novel bias measure is introduced, defining biases as
the difference in sentiment valuation of female and male sample versions.
Second, we comprehensively analyse BERT's biases on the example of a realistic
IMDB movie classifier. By systematically varying elements of the training
pipeline, we can conclude regarding their impact on the final model bias. Seven
different public BERT models in nine training conditions, i.e. 63 models in
total, are compared. Almost all conditions yield significant gender biases.
Results indicate that reflected biases stem from public BERT models rather than
task-specific data, emphasising the weight of responsible usage.
☆ IDOL: Indicator-oriented Logic Pre-training for Logical Reasoning ACL 2023
In the field of machine reading comprehension (MRC), existing systems have
surpassed the average performance of human beings in many tasks like SQuAD.
However, there is still a long way to go when it comes to logical reasoning.
Although some methods for it have been put forward, they either are designed in
a quite complicated way or rely too much on external structures. In this paper,
we proposed IDOL (InDicator-Oriented Logic Pre-training), an easy-to-understand
but highly effective further pre-training task which logically strengthens the
pre-trained models with the help of 6 types of logical indicators and a
logically rich dataset LGP (LoGic Pre-training). IDOL achieves state-of-the-art
performance on ReClor and LogiQA, the two most representative benchmarks in
logical reasoning MRC, and is proven to be capable of generalizing to different
pre-trained models and other types of MRC benchmarks like RACE and SQuAD 2.0
while keeping competitive general language understanding ability through
testing on tasks in GLUE. Besides, at the beginning of the era of large
language models, we take several of them like ChatGPT into comparison and find
that IDOL still shows its advantage.
comment: Accepted to the Findings of ACL 2023
☆ Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?
For Pretrained Language Models (PLMs), their susceptibility to noise has
recently been linked to subword segmentation. However, it is unclear which
aspects of segmentation affect their understanding. This study assesses the
robustness of PLMs against various disrupted segmentation caused by noise. An
evaluation framework for subword segmentation, named Contrastive Lexical
Semantic (CoLeS) probe, is proposed. It provides a systematic categorization of
segmentation corruption under noise and evaluation protocols by generating
contrastive datasets with canonical-noisy word pairs. Experimental results
indicate that PLMs are unable to accurately compute word meanings if the noise
introduces completely different subwords, small subword fragments, or a large
number of additional subwords, particularly when they are inserted within other
subwords.
☆ A Survey on Out-of-Distribution Evaluation of Neural NLP Models
Adversarial robustness, domain generalization and dataset biases are three
active lines of research contributing to out-of-distribution (OOD) evaluation
on neural NLP models. However, a comprehensive, integrated discussion of the
three research lines is still lacking in the literature. In this survey, we 1)
compare the three lines of research under a unifying definition; 2) summarize
the data-generating processes and evaluation protocols for each line of
research; and 3) emphasize the challenges and opportunities for future work.
☆ GroundNLQ @ Ego4D Natural Language Queries Challenge 2023 CVPR 2023
Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou
In this report, we present our champion solution for Ego4D Natural Language
Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a
video, an effective egocentric feature extractor and a powerful grounding model
are required. Motivated by this, we leverage a two-stage pre-training strategy
to train egocentric feature extractors and the grounding model on video
narrations, and further fine-tune the model on annotated data. In addition, we
introduce a novel grounding model GroundNLQ, which employs a multi-modal
multi-scale grounding module for effective video and text fusion and various
temporal intervals, especially for long videos. On the blind test set,
GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively,
and surpasses all other teams by a noticeable margin. Our code will be released
at\url{https://github.com/houzhijian/GroundNLQ}.
comment: 5 pages, 2 figures, 4 tables, the champion solution for Ego4D Natural
Language Queries Challenge in CVPR 2023
☆ MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation
Humans talk in free-form while negotiating the expressed meanings or common
ground. Despite the impressive conversational abilities of the large generative
language models, they do not consider the individual differences in contextual
understanding in a shared situated environment. In this work, we propose
MindDial, a novel conversational framework that can generate situated free-form
responses to negotiate common ground. We design an explicit mind module that
can track three-level beliefs -- the speaker's belief, the speaker's prediction
of the listener's belief, and the common belief based on the gap between the
first two. Then the speaking act classification head will decide to continue to
talk, end this turn, or take task-related action. We augment a common ground
alignment dataset MutualFriend with belief dynamics annotation, of which the
goal is to find a single mutual friend based on the free chat between two
agents. Experiments show that our model with mental state modeling can resemble
human responses when aligning common ground meanwhile mimic the natural human
conversation flow. The ablation study further validates the third-level common
belief can aggregate information of the first and second-order beliefs and
align common ground more efficiently.
☆ C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation ACL2023
Existing reference-free turn-level evaluation metrics for chatbots
inadequately capture the interaction between the user and the system.
Consequently, they often correlate poorly with human evaluations. To address
this issue, we propose a novel model-agnostic approach that leverages
Conditional Pointwise Mutual Information (C-PMI) to measure the turn-level
interaction between the system and the user based on a given evaluation
dimension. Experimental results on the widely used FED dialogue evaluation
dataset demonstrate that our approach significantly improves the correlation
with human judgment compared with existing evaluation systems. By replacing the
negative log-likelihood-based scorer with our proposed C-PMI scorer, we achieve
a relative 60.5% higher Spearman correlation on average for the FED evaluation
metric. Our code is publicly available at https://github.com/renll/C-PMI.
comment: Presented at ACL2023 DiaDoc Workshop
☆ Emulating Reader Behaviors for Fake News Detection
The wide dissemination of fake news has affected our lives in many aspects,
making fake news detection important and attracting increasing attention.
Existing approaches make substantial contributions in this field by modeling
news from a single-modal or multi-modal perspective. However, these modal-based
methods can result in sub-optimal outcomes as they ignore reader behaviors in
news consumption and authenticity verification. For instance, they haven't
taken into consideration the component-by-component reading process: from the
headline, images, comments, to the body, which is essential for modeling news
with more granularity. To this end, we propose an approach of Emulating the
behaviors of readers (Ember) for fake news detection on social media,
incorporating readers' reading and verificating process to model news from the
component perspective thoroughly. Specifically, we first construct
intra-component feature extractors to emulate the behaviors of semantic
analyzing on each component. Then, we design a module that comprises
inter-component feature extractors and a sequence-based aggregator. This module
mimics the process of verifying the correlation between components and the
overall reading and verification sequence. Thus, Ember can handle the news with
various components by emulating corresponding sequences. We conduct extensive
experiments on nine real-world datasets, and the results demonstrate the
superiority of Ember.
comment: 12 pages
☆ Learning to Rank in Generative Retrieval
Generative retrieval is a promising new paradigm in text retrieval that
generates identifier strings of relevant passages as the retrieval target. This
paradigm leverages powerful generation models and represents a new paradigm
distinct from traditional learning-to-rank methods. However, despite its rapid
development, current generative retrieval methods are still limited. They
typically rely on a heuristic function to transform predicted identifiers into
a passage rank list, which creates a gap between the learning objective of
generative retrieval and the desired passage ranking target. Moreover, the
inherent exposure bias problem of text generation also persists in generative
retrieval. To address these issues, we propose a novel framework, called LTRGR,
that combines generative retrieval with the classical learning-to-rank
paradigm. Our approach involves training an autoregressive model using a
passage rank loss, which directly optimizes the autoregressive model toward the
optimal passage ranking. This framework only requires an additional training
step to enhance current generative retrieval systems and does not add any
burden to the inference stage. We conducted experiments on three public
datasets, and our results demonstrate that LTRGR achieves state-of-the-art
performance among generative retrieval methods, indicating its effectiveness
and robustness.
☆ Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation
Haitao Tang, Yu Fu, Lei Sun, Jiabin Xue, Dan Liu, Yongchao Li, Zhiqiang Ma, Minghui Wu, Jia Pan, Genshun Wan, Ming'en Zhao
Transducer is one of the mainstream frameworks for streaming speech
recognition. There is a performance gap between the streaming and non-streaming
transducer models due to limited context. To reduce this gap, an effective way
is to ensure that their hidden and output distributions are consistent, which
can be achieved by hierarchical knowledge distillation. However, it is
difficult to ensure the distribution consistency simultaneously because the
learning of the output distribution depends on the hidden one. In this paper,
we propose an adaptive two-stage knowledge distillation method consisting of
hidden layer learning and output layer learning. In the former stage, we learn
hidden representation with full context by applying mean square error loss
function. In the latter stage, we design a power transformation based adaptive
smoothness method to learn stable output distribution. It achieved 19\%
relative reduction in word error rate, and a faster response for the first
token compared with the original streaming model in LibriSpeech corpus.
☆ DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization ACL2023
Adversarial training is one of the best-performing methods in improving the
robustness of deep language models. However, robust models come at the cost of
high time consumption, as they require multi-step gradient ascents or word
substitutions to obtain adversarial samples. In addition, these generated
samples are deficient in grammatical quality and semantic consistency, which
impairs the effectiveness of adversarial training. To address these problems,
we introduce a novel, effective procedure for instead adversarial training with
only clean data. Our procedure, distribution shift risk minimization (DSRM),
estimates the adversarial loss by perturbing the input data's probability
distribution rather than their embeddings. This formulation results in a robust
model that minimizes the expected global loss under adversarial attacks. Our
approach requires zero adversarial samples for training and reduces time
consumption by up to 70\% compared to current best-performing adversarial
training methods. Experiments demonstrate that DSRM considerably improves
BERT's resistance to textual adversarial attacks and achieves state-of-the-art
robust accuracy on various benchmarks.
comment: Accepted by ACL2023
☆ YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus
Machine learning for sign languages is bottlenecked by data. In this paper,
we present YouTube-ASL, a large-scale, open-domain corpus of American Sign
Language (ASL) videos and accompanying English captions drawn from YouTube.
With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as
large and has ~10x as many unique signers as the largest prior ASL dataset. We
train baseline models for ASL to English translation on YouTube-ASL and
evaluate them on How2Sign, where we achieve a new finetuned state of the art of
12.39 BLEU and, for the first time, report zero-shot results.
☆ Investigating Cross-Domain Behaviors of BERT in Review Understanding
Review score prediction requires review text understanding, a critical
real-world application of natural language processing. Due to dissimilar text
domains in product reviews, a common practice is fine-tuning BERT models upon
reviews of differing domains. However, there has not yet been an empirical
study of cross-domain behaviors of BERT models in the various tasks of product
review understanding. In this project, we investigate text classification BERT
models fine-tuned on single-domain and multi-domain Amazon review data. In our
findings, though single-domain models achieved marginally improved performance
on their corresponding domain compared to multi-domain models, multi-domain
models outperformed single-domain models when evaluated on multi-domain data,
single-domain data the single-domain model was not fine-tuned on, and on
average when considering all tests. Though slight increases in accuracy can be
achieved through single-domain model fine-tuning, computational resources and
costs can be reduced by utilizing multi-domain models that perform well across
domains.
comment: 9 pages, 1 figure, 2 tables
♻ ☆ When Does Translation Require Context? A Data-driven, Multilingual Exploration ACL2023
Although proper handling of discourse significantly contributes to the
quality of machine translation (MT), these improvements are not adequately
measured in common translation quality metrics. Recent works in context-aware
MT attempt to target a small set of discourse phenomena during evaluation,
however not in a fully systematic way. In this paper, we develop the
Multilingual Discourse-Aware (MuDA) benchmark, a series of taggers that
identify and evaluate model performance on discourse phenomena in any given
dataset. The choice of phenomena is inspired by a novel methodology to
systematically identify translations requiring context. We confirm the
difficulty of previously studied phenomena while uncovering others that were
previously unaddressed. We find that common context-aware MT models make only
marginal improvements over context-agnostic models, which suggests these models
do not handle these ambiguities effectively. We release code and data for 14
language pairs to encourage the MT community to focus on accurately capturing
discourse phenomena.
comment: Accepted at ACL2023
♻ ☆ Constructing Word-Context-Coupled Space Aligned with Associative Knowledge Relations for Interpretable Language Modeling ACL 2023
As the foundation of current natural language processing methods, pre-trained
language model has achieved excellent performance. However, the black-box
structure of the deep neural network in pre-trained language models seriously
limits the interpretability of the language modeling process. After revisiting
the coupled requirement of deep neural representation and semantics logic of
language modeling, a Word-Context-Coupled Space (W2CSpace) is proposed by
introducing the alignment processing between uninterpretable neural
representation and interpretable statistical logic. Moreover, a clustering
process is also designed to connect the word- and context-level semantics.
Specifically, an associative knowledge network (AKN), considered interpretable
statistical logic, is introduced in the alignment process for word-level
semantics. Furthermore, the context-relative distance is employed as the
semantic feature for the downstream classifier, which is greatly different from
the current uninterpretable semantic representations of pre-trained models. Our
experiments for performance evaluation and interpretable analysis are executed
on several types of datasets, including SIGHAN, Weibo, and ChnSenti. Wherein a
novel evaluation strategy for the interpretability of machine learning models
is first proposed. According to the experimental results, our language model
can achieve better performance and highly credible interpretable ability
compared to related state-of-the-art methods.
comment: Accepted at ACL 2023, Findings
♻ ☆ Debiased Automatic Speech Recognition for Dysarthric Speech via Sample Reweighting with Sample Affinity Test
Automatic speech recognition systems based on deep learning are mainly
trained under empirical risk minimization (ERM). Since ERM utilizes the
averaged performance on the data samples regardless of a group such as healthy
or dysarthric speakers, ASR systems are unaware of the performance disparities
across the groups. This results in biased ASR systems whose performance
differences among groups are severe. In this study, we aim to improve the ASR
system in terms of group robustness for dysarthric speakers. To achieve our
goal, we present a novel approach, sample reweighting with sample affinity test
(Re-SAT). Re-SAT systematically measures the debiasing helpfulness of the given
data sample and then mitigates the bias by debiasing helpfulness-based sample
reweighting. Experimental results demonstrate that Re-SAT contributes to
improved ASR performance on dysarthric speech without performance degradation
on healthy speech.
comment: Accepted by Interspeech 2023
♻ ☆ Language Models are Bounded Pragmatic Speakers ICML 2023
How do language models "think"? This paper formulates a probabilistic
cognitive model called the bounded pragmatic speaker, which can characterize
the operation of different variations of language models. Specifically, we
demonstrate that large language models fine-tuned with reinforcement learning
from human feedback (Ouyang et al., 2022) embody a model of thought that
conceptually resembles a fast-and-slow model (Kahneman, 2011), which
psychologists have attributed to humans. We discuss the limitations of
reinforcement learning from human feedback as a fast-and-slow model of thought
and propose avenues for expanding this framework. In essence, our research
highlights the value of adopting a cognitive probabilistic modeling approach to
gain insights into the comprehension, evaluation, and advancement of language
models.
comment: Proceedings of the First Workshop on Theory of Mind in Communicating
Agents at (TOM @ ICML 2023)
♻ ☆ Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation
Image captioning aims to describe visual content in natural language. As 'a
picture is worth a thousand words', there could be various correct descriptions
for an image. However, with maximum likelihood estimation as the training
objective, the captioning model is penalized whenever its prediction mismatches
with the label. For instance, when the model predicts a word expressing richer
semantics than the label, it will be penalized and optimized to prefer more
concise expressions, referred to as conciseness optimization. In contrast,
predictions that are more concise than labels lead to richness optimization.
Such conflicting optimization directions could eventually result in the model
generating general descriptions. In this work, we introduce Semipermeable
MaxImum Likelihood Estimation (SMILE), which allows richness optimization while
blocking conciseness optimization, thus encouraging the model to generate
longer captions with more details. Extensive experiments on two mainstream
image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE
significantly enhances the descriptiveness of generated captions. We further
provide in-depth investigations to facilitate a better understanding of how
SMILE works.
♻ ☆ BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Taşar, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, Thomas Wolf
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License.
♻ ☆ What Do Compressed Multilingual Machine Translation Models Forget? EMNLP 2022
Alireza Mohammadshahi, Vassilina Nikoulina, Alexandre Berard, Caroline Brun, James Henderson, Laurent Besacier
Recently, very large pre-trained models achieve state-of-the-art results in
various natural language processing (NLP) tasks, but their size makes it more
challenging to apply them in resource-constrained environments. Compression
techniques allow to drastically reduce the size of the models and therefore
their inference time with negligible impact on top-tier metrics. However, the
general performance averaged across multiple tasks and/or languages may hide a
drastic performance drop on under-represented features, which could result in
the amplification of biases encoded by the models. In this work, we assess the
impact of compression methods on Multilingual Neural Machine Translation models
(MNMT) for various language groups, gender, and semantic biases by extensive
analysis of compressed models on different machine translation benchmarks, i.e.
FLORES-101, MT-Gender, and DiBiMT. We show that the performance of
under-represented languages drops significantly, while the average BLEU metric
only slightly decreases. Interestingly, the removal of noisy memorization with
compression leads to a significant improvement for some medium-resource
languages. Finally, we demonstrate that compression amplifies intrinsic gender
and semantic biases, even in high-resource languages. Code:
https://github.com/alirezamshi/bias-compressedMT
comment: Accepted to Findings of EMNLP 2022, presented at WMT 2022
♻ ☆ Kosmos-2: Grounding Multimodal Large Language Models to the World
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (e.g., bounding boxes) and
grounding text to the visual world. Specifically, we represent refer
expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
object descriptions are sequences of location tokens. Together with multimodal
corpora, we construct large-scale data of grounded image-text pairs (called
GrIT) to train the model. In addition to the existing capabilities of MLLMs
(e.g., perceiving general modalities, following instructions, and performing
in-context learning), Kosmos-2 integrates the grounding capability into
downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
including (i) multimodal grounding, such as referring expression comprehension,
and phrase grounding, (ii) multimodal referring, such as referring expression
generation, (iii) perception-language tasks, and (iv) language understanding
and generation. This work lays out the foundation for the development of
Embodiment AI and sheds light on the big convergence of language, multimodal
perception, action, and world modeling, which is a key step toward artificial
general intelligence. Data, demo, and pretrained models are available at
https://aka.ms/kosmos-2.
comment: 20 pages
♻ ☆ mCPT at SemEval-2023 Task 3: Multilingual Label-Aware Contrastive Pre-Training of Transformers for Few- and Zero-shot Framing Detection SemEval'23
This paper presents the winning system for the zero-shot Spanish framing
detection task, which also achieves competitive places in eight additional
languages. The challenge of the framing detection task lies in identifying a
set of 14 frames when only a few or zero samples are available, i.e., a
multilingual multi-label few- or zero-shot setting. Our developed solution
employs a pre-training procedure based on multilingual Transformers using a
label-aware contrastive loss function. In addition to describing the system, we
perform an embedding space analysis and ablation study to demonstrate how our
pre-training procedure supports framing detection to advance computational
framing analysis.
comment: Accepted for publication at SemEval'23
♻ ☆ Max-Margin Token Selection in Attention Mechanism
Attention mechanism is a central component of the transformer architecture
which led to the phenomenal success of large language models. However, the
theoretical principles underlying the attention mechanism are poorly
understood, especially its nonconvex optimization dynamics. In this work, we
explore the seminal softmax-attention model $f(\boldsymbol{X})=\langle
\boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle$, where
$\boldsymbol{X}$ is the token sequence and
$(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p})$ are trainable parameters. We
prove that running gradient descent on $\boldsymbol{p}$, or equivalently
$\boldsymbol{W}$, converges in direction to a max-margin solution that
separates $\textit{locally-optimal}$ tokens from non-optimal ones. This clearly
formalizes attention as an optimal token selection mechanism. Remarkably, our
results are applicable to general data and precisely characterize
$\textit{optimality}$ of tokens in terms of the value embeddings
$\boldsymbol{Xv}$ and problem geometry. We also provide a broader
regularization path analysis that establishes the margin maximizing nature of
attention even for nonlinear prediction heads. When optimizing $\boldsymbol{v}$
and $\boldsymbol{p}$ simultaneously with logistic loss, we identify conditions
under which the regularization paths directionally converge to their respective
hard-margin SVM solutions where $\boldsymbol{v}$ separates the input features
based on their labels. Interestingly, the SVM formulation of $\boldsymbol{p}$
is influenced by the support vector geometry of $\boldsymbol{v}$. Finally, we
verify our theoretical findings via numerical experiments and provide insights.
comment: minor edits and title change
♻ ☆ Auditing large language models: a three-layered approach
Large language models (LLMs) represent a major advance in artificial
intelligence (AI) research. However, the widespread use of LLMs is also coupled
with significant ethical and social challenges. Previous research has pointed
towards auditing as a promising governance mechanism to help ensure that AI
systems are designed and deployed in ways that are ethical, legal, and
technically robust. However, existing auditing procedures fail to address the
governance challenges posed by LLMs, which display emergent capabilities and
are adaptable to a wide range of downstream tasks. In this article, we address
that gap by outlining a novel blueprint for how to audit LLMs. Specifically, we
propose a three-layered approach, whereby governance audits (of technology
providers that design and disseminate LLMs), model audits (of LLMs after
pre-training but prior to their release), and application audits (of
applications based on LLMs) complement and inform each other. We show how
audits, when conducted in a structured and coordinated manner on all three
levels, can be a feasible and effective mechanism for identifying and managing
some of the ethical and social risks posed by LLMs. However, it is important to
remain realistic about what auditing can reasonably be expected to achieve.
Therefore, we discuss the limitations not only of our three-layered approach
but also of the prospect of auditing LLMs at all. Ultimately, this article
seeks to expand the methodological toolkit available to technology providers
and policymakers who wish to analyse and evaluate LLMs from technical, ethical,
and legal perspectives.
comment: 22 pages, 2 figures. AI Ethics (2023)
♻ ☆ Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering
There has been a growing effort to replace hand extraction of data from
research papers with automated data extraction based on natural language
processing, language models, and recently, large language models (LLMs).
Although these methods enable efficient extraction of data from large sets of
research papers, they require a significant amount of up-front effort,
expertise, and coding. In this work we propose the ChatExtract method that can
fully automate very accurate data extraction with minimal initial effort and
background, using an advanced conversational LLM. ChatExtract consists of a set
of engineered prompts applied to a conversational LLM that both identify
sentences with data, extract that data, and assure the data's correctness
through a series of follow-up questions. These follow-up questions largely
overcome known issues with LLMs providing factually inaccurate responses.
ChatExtract can be applied with any conversational LLMs and yields very high
quality data extraction. In tests on materials data we find precision and
recall both close to 90% from the best conversational LLMs, like ChatGPT-4. We
demonstrate that the exceptional performance is enabled by the information
retention in a conversational model combined with purposeful redundancy and
introducing uncertainty through follow-up prompts. These results suggest that
approaches similar to ChatExtract, due to their simplicity, transferability,
and accuracy are likely to become powerful tools for data extraction in the
near future. Finally, databases for critical cooling rates of metallic glasses
and yield strengths of high entropy alloys are developed using ChatExtract.
comment: 7 pages, 2 figures, 1 table
♻ ☆ Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation ICML 2023
In this paper, we introduce a data-driven approach for Formality-Sensitive
Machine Translation (FSMT) that caters to the unique linguistic properties of
four target languages. Our methodology centers on two core strategies: 1)
language-specific data handling, and 2) synthetic data generation using
large-scale language models and empirical prompt engineering. This approach
demonstrates a considerable improvement over the baseline, highlighting the
effectiveness of data-centric techniques. Our prompt engineering strategy
further improves performance by producing superior synthetic translation
examples.
comment: Accepted for Data-centric Machine Learning Research (DMLR) Workshop
at ICML 2023
♻ ☆ Cross-Attention is Not Enough: Incongruity-Aware Hierarchical Multimodal Sentiment Analysis and Emotion Recognition
Fusing multiple modalities for affective computing tasks has proven effective
for performance improvement. However, how multimodal fusion works is not well
understood, and its use in the real world usually results in large model sizes.
In this work, on sentiment and emotion analysis, we first analyze how the
salient affective information in one modality can be affected by the other in
crossmodal attention. We find that inter-modal incongruity exists at the latent
level due to crossmodal attention. Based on this finding, we propose a
lightweight model via Hierarchical Crossmodal Transformer with Modality Gating
(HCT-MG), which determines a primary modality according to its contribution to
the target task and then hierarchically incorporates auxiliary modalities to
alleviate inter-modal incongruity and reduce information redundancy. The
experimental evaluation on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and
IEMOCAP verifies the efficacy of our approach, showing that it: 1) achieves
better performance than prior work as well as manual selection of the primary
modality; 2) can recognize hard samples whose emotions are hard to tell; 3)
mitigates the inter-modal incongruity at the latent level when modalities have
mismatched affective tendencies; 4) reduces model size to less than 1M
parameters while outperforming existing models of similar sizes.
comment: *Equal contribution
♻ ☆ Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models ACL 2023
Reasoning about time is of fundamental importance. Many facts are
time-dependent. For example, athletes change teams from time to time, and
different government officials are elected periodically. Previous
time-dependent question answering (QA) datasets tend to be biased in either
their coverage of time spans or question types. In this paper, we introduce a
comprehensive probing dataset \tempreason to evaluate the temporal reasoning
capability of large language models. Our dataset includes questions of three
temporal reasoning levels. In addition, we also propose a novel learning
framework to improve the temporal reasoning capability of large language
models, based on temporal span extraction and time-sensitive reinforcement
learning. We conducted experiments in closed book QA, open book QA, and
reasoning QA settings and demonstrated the effectiveness of our approach. Our
code and data are released on https://github.com/DAMO-NLP-SG/TempReason.
comment: ACL 2023
♻ ☆ Neural Topic Modeling with Continual Lifelong Learning ICML2020
Lifelong learning has recently attracted attention in building machine
learning systems that continually accumulate and transfer knowledge to help
future learning. Unsupervised topic modeling has been popularly used to
discover topics from document collections. However, the application of topic
modeling is challenging due to data sparsity, e.g., in a small collection of
(short) documents and thus, generate incoherent topics and sub-optimal document
representations. To address the problem, we propose a lifelong learning
framework for neural topic modeling that can continuously process streams of
document collections, accumulate topics and guide future topic modeling tasks
by knowledge transfer from several sources to better deal with the sparse data.
In the lifelong process, we particularly investigate jointly: (1) sharing
generative homologies (latent topics) over lifetime to transfer prior
knowledge, and (2) minimizing catastrophic forgetting to retain the past
learning via novel selective data augmentation, co-training and topic
regularization approaches. Given a stream of document collections, we apply the
proposed Lifelong Neural Topic Modeling (LNTM) framework in modeling three
sparse document collections as future tasks and demonstrate improved
performance quantified by perplexity, topic coherence and information retrieval
task.
comment: Accepted at ICML2020 (13 pages, 11 figures, 9 tables)
♻ ☆ Explainable and Discourse Topic-aware Neural Language Understanding ICML2020
Marrying topic models and language models exposes language understanding to a
broader source of document-level context beyond sentences via topics. While
introducing topical semantics in language models, existing approaches
incorporate latent document topic proportions and ignore topical discourse in
sentences of the document. This work extends the line of research by
additionally introducing an explainable topic representation in language
understanding, obtained from a set of key terms correspondingly for each latent
topic of the proportion. Moreover, we retain sentence-topic associations along
with document-topic association by modeling topical discourse for every
sentence in the document. We present a novel neural composite language model
that exploits both the latent and explainable topics along with topical
discourse at sentence-level in a joint learning framework of topic and language
models. Experiments over a range of tasks such as language modeling, word sense
disambiguation, document classification, retrieval and text generation
demonstrate ability of the proposed model in improving language understanding.
comment: Accepted at ICML2020 (13 pages, 2 figures)
♻ ☆ Cross-Language Speech Emotion Recognition Using Multimodal Dual Attention Transformers
Despite the recent progress in speech emotion recognition (SER),
state-of-the-art systems are unable to achieve improved performance in
cross-language settings. In this paper, we propose a Multimodal Dual Attention
Transformer (MDAT) model to improve cross-language SER. Our model utilises
pre-trained models for multimodal feature extraction and is equipped with a
dual attention mechanism including graph attention and co-attention to capture
complex dependencies across different modalities and achieve improved
cross-language SER results using minimal target language data. In addition, our
model also exploits a transformer encoder layer for high-level feature
representation to improve emotion classification accuracy. In this way, MDAT
performs refinement of feature representation at various stages and provides
emotional salient features to the classification layer. This novel approach
also ensures the preservation of modality-specific emotional information while
enhancing cross-modality and cross-language interactions. We assess our model's
performance on four publicly available SER datasets and establish its superior
effectiveness compared to recent approaches and baseline models.
comment: Under Review IEEE TMM
♻ ☆ SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks
As the size of large language models continue to scale, so does the
computational resources required to run it. Spiking Neural Networks (SNNs) have
emerged as an energy-efficient approach to deep learning that leverage sparse
and event-driven activations to reduce the computational overhead associated
with model inference. While they have become competitive with non-spiking
models on many computer vision tasks, SNNs have also proven to be more
challenging to train. As a result, their performance lags behind modern deep
learning, and we are yet to see the effectiveness of SNNs in language
generation. In this paper, inspired by the Receptance Weighted Key Value (RWKV)
language model, we successfully implement `SpikeGPT', a generative language
model with binary, event-driven spiking activation units. We train the proposed
model on two model variants: 45M and 216M parameters. To the best of our
knowledge, SpikeGPT is the largest backpropagation-trained SNN model to date,
rendering it suitable for both the generation and comprehension of natural
language. We achieve this by modifying the transformer block to replace
multi-head self attention to reduce quadratic computational complexity O(N^2)
to linear complexity O(N) with increasing sequence length. Input tokens are
instead streamed in sequentially to our attention mechanism (as with typical
SNNs). Our preliminary experiments show that SpikeGPT remains competitive with
non-spiking models on tested benchmarks, while maintaining 20x fewer operations
when processed on neuromorphic hardware that can leverage sparse, event-driven
activations.
♻ ☆ WACO: Word-Aligned Contrastive Learning for Speech Translation ACL 2023
End-to-end Speech Translation (E2E ST) aims to directly translate source
speech into target text. Existing ST methods perform poorly when only extremely
small speech-text data are available for training. We observe that an ST
model's performance closely correlates with its embedding similarity between
speech and source transcript. In this paper, we propose Word-Aligned
COntrastive learning (WACO), a simple and effective method for extremely
low-resource speech-to-text translation. Our key idea is bridging word-level
representations for both speech and text modalities via contrastive learning.
We evaluate WACO and other methods on the MuST-C dataset, a widely used ST
benchmark, and on a low-resource direction Maltese-English from IWSLT 2023. Our
experiments demonstrate that WACO outperforms the best baseline by 9+ BLEU
points with only 1-hour parallel ST data. Code is available at
https://github.com/owaski/WACO.
comment: ACL 2023 Poster
♻ ☆ Survey on Sociodemographic Bias in Natural Language Processing
Deep neural networks often learn unintended biases during training, which
might have harmful effects when deployed in real-world settings. This paper
surveys 209 papers on bias in NLP models, most of which address
sociodemographic bias. To better understand the distinction between bias and
real-world harm, we turn to ideas from psychology and behavioral economics to
propose a definition for sociodemographic bias. We identify three main
categories of NLP bias research: types of bias, quantifying bias, and
debiasing. We conclude that current approaches on quantifying bias face
reliability issues, that many of the bias metrics do not relate to real-world
biases, and that current debiasing techniques are superficial and hide bias
rather than removing it. Finally, we provide recommendations for future work.
comment: 23 pages, 1 figure
♻ ☆ InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback
Humans write code in a fundamentally interactive manner and rely on constant
execution feedback to correct errors, resolve ambiguities, and decompose tasks.
While LLMs have recently exhibited promising coding capabilities, current
coding benchmarks mostly consider a static instruction-to-code sequence
transduction process, which has the potential for error propagation and a
disconnect between the generated code and its final execution environment. To
address this gap, we introduce InterCode, a lightweight, flexible, and
easy-to-use framework of interactive coding as a standard reinforcement
learning (RL) environment, with code as actions and execution feedback as
observations. Our framework is language and platform agnostic, uses
self-contained Docker environments to provide safe and reproducible execution,
and is compatible out-of-the-box with traditional seq2seq coding methods, while
enabling the development of new methods for interactive code generation. We use
InterCode to create two interactive code environments with Bash and SQL as
action spaces, leveraging data from the static Spider and NL2Bash datasets. We
demonstrate InterCode's viability as a testbed by evaluating multiple
state-of-the-art LLMs configured with different prompting strategies such as
ReAct and Plan & Solve. Our results showcase the benefits of interactive code
generation and demonstrate that InterCode can serve as a challenging benchmark
for advancing code understanding and generation capabilities. InterCode is
designed to be easily extensible and can even be used to incorporate new tasks
such as Capture the Flag, a popular coding puzzle that is inherently multi-step
and involves multiple programming languages. Project site with code and data:
https://intercode-benchmark.github.io
comment: Project site with code and data:
https://intercode-benchmark.github.io
♻ ☆ Mu$^{2}$SLAM: Multitask, Multilingual Speech and Language Models ICML 2023
We present Mu$^{2}$SLAM, a multilingual sequence-to-sequence model
pre-trained jointly on unlabeled speech, unlabeled text and supervised data
spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST)
and Machine Translation (MT), in over 100 languages. By leveraging a quantized
representation of speech as a target, Mu$^{2}$SLAM trains the speech-text
models with a sequence-to-sequence masked denoising objective similar to T5 on
the decoder and a masked language modeling (MLM) objective on the encoder, for
both unlabeled speech and text, while utilizing the supervised tasks to improve
cross-lingual and cross-modal representation alignment within the model. On
CoVoST AST, Mu$^{2}$SLAM establishes a new state-of-the-art for models trained
on public datasets, improving on xx-en translation over the previous best by
1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR,
our model matches the performance of an mSLAM model fine-tuned with an RNN-T
decoder, despite using a relatively weaker sequence-to-sequence architecture.
On text understanding tasks, our model improves by more than 6\% over mSLAM on
XNLI, getting closer to the performance of mT5 models of comparable capacity on
XNLI and TydiQA, paving the way towards a single model for all speech and text
understanding tasks.
comment: ICML 2023
♻ ☆ Blank Collapse: Compressing CTC emission for the faster decoding
Connectionist Temporal Classification (CTC) model is a very efficient method
for modeling sequences, especially for speech data. In order to use CTC model
as an Automatic Speech Recognition (ASR) task, the beam search decoding with an
external language model like n-gram LM is necessary to obtain reasonable
results. In this paper we analyze the blank label in CTC beam search deeply and
propose a very simple method to reduce the amount of calculation resulting in
faster beam search decoding speed. With this method, we can get up to 78%
faster decoding speed than ordinary beam search decoding with a very small loss
of accuracy in LibriSpeech datasets. We prove this method is effective not only
practically by experiments but also theoretically by mathematical reasoning. We
also observe that this reduction is more obvious if the accuracy of the model
is higher.
comment: Accepted in Interspeech 2023